Skip to content

Minor spec change around UNAVAILABLE / RetryInfo #669

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

DylanRussell
Copy link

It doesn't make sense that we recommend a RetryInfo with time 0, none of the implementations i've seen special case this value, and it goes against the rest of the spec which says to set it to a real integer.

It also doesn't make since to say the server "MUST" use UNAVAILABLE for retryable errors, and then go on to mention other retryable error status codes..

@dashpole
Copy link
Contributor

dashpole commented Jun 3, 2025

rest of the spec which says to set it to a real integer

Where does it say that?

@DylanRussell
Copy link
Author

DylanRussell commented Jun 3, 2025

https://github.com/open-telemetry/opentelemetry-proto/blob/main/docs/specification.md#otlpgrpc-throttling - this is the section that talks about how to use RetryInfo.

To signal backpressure when using gRPC transport, the server MUST return an error with code Unavailable and MAY supply additional details via status using RetryInfo. Here is a snippet of sample Go code to illustrate:

The only other mention of RetryInfo in the spec is:

The client SHOULD interpret RESOURCE_EXHAUSTED code as retryable only if the server signals that the recovery from resource exhaustion is possible. This is signaled by the server by returning a status containing RetryInfo. In this case the behavior of the server and the client is exactly as described in OTLP/gRPC Throttling section. If no such status is returned, then the RESOURCE_EXHAUSTED code SHOULD be treated as non-retryable.

This part of the spec I'm removing doesn't make sense.. Why must retryable errors return a RetryInfo of 0 and UNAVAILABLE ?

@DylanRussell
Copy link
Author

i think this can get merged now

Copy link
Member

@tigrannajaryan tigrannajaryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking temporarily, since I am not sure this is not a breaking change.

We have removed a MUST requirement from the servers. Removing it means clients that strictly rely on this requirement may not work correctly anymore.

We must preserve the interoperability of (old,new) and (new,old) pairs of clients and servers. Please show the analysis that demonstrates this interoperability.

@pellared
Copy link
Member

@tigrannajaryan, this "MUST requirement" does not make sense and is conflicting with other parts of the specification.

This

The server MUST indicate retryable errors using code Unavailable

conficts with

The server MAY use other gRPC codes to indicate retryable [...] errors.

This

MAY supply additional details via status using RetryInfo containing 0 value of RetryDelay

seems to have only sense for RESOURCE_EXHAUSTED but it is is already described in a dedicated sentence

The client SHOULD interpret `RESOURCE_EXHAUSTED` code as retryable only if the
server signals that the recovery from resource exhaustion is possible.
This is signaled by the server by returning
[a status](https://godoc.org/google.golang.org/grpc/status#Status.WithDetails) containing
[RetryInfo](https://github.com/googleapis/googleapis/blob/6a8c7914d1b79bd832b5157a09a9332e8cbd16d4/google/rpc/error_details.proto#L40).
In this case the behavior of the server and the client is exactly as described in
[OTLP/gRPC Throttling](#otlpgrpc-throttling) section. If no such status is returned,
then the `RESOURCE_EXHAUSTED` code SHOULD be treated as non-retryable.

it is very strange to call out 0 especially given there is also an example like this:

To signal backpressure when using gRPC transport, the server MUST return an
error with code [Unavailable](https://godoc.org/google.golang.org/grpc/codes)
and MAY supply additional
[details via status](https://godoc.org/google.golang.org/grpc/status#Status.WithDetails)
using
[RetryInfo](https://github.com/googleapis/googleapis/blob/6a8c7914d1b79bd832b5157a09a9332e8cbd16d4/google/rpc/error_details.proto#L40).
Here is a snippet of sample Go code to illustrate:
```go
// Do this on the server side.
st, err := status.New(codes.Unavailable, "Server is unavailable").
WithDetails(&errdetails.RetryInfo{RetryDelay: &duration.Duration{Seconds: 30}})
if err != nil {
log.Fatal(err)
}
return st.Err()
...
// Do this on the client side.
st := status.Convert(err)
for _, detail := range st.Details() {
switch t := detail.(type) {
case *errdetails.RetryInfo:
if t.RetryDelay.Seconds > 0 || t.RetryDelay.Nanos > 0 {
// Wait before retrying.
}
}
}
```

@pellared

This comment was marked as resolved.

@DylanRussell
Copy link
Author

Yeah those suggestions SG. I've updated the PR

@tigrannajaryan
Copy link
Member

This fixes a bug in the spec and should be allowed.

@pellared
Copy link
Member

@open-telemetry/technical-committee, can someone merge it? 😉

@tigrannajaryan
Copy link
Member

Can't merge, the checks are stuck.

@tigrannajaryan
Copy link
Member

@open-telemetry/technical-committee can we have a few more eyes on this? I want to make sure we didn't miss anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants